Instructions

Exercise

Melbourne property prices have taken their biggest hit since 2012, falling by almost 2 per cent in the past three months Jim Malo, Jul 26 2018, Domain

This assignment explores the data provided on Melbourne house prices by Anthony Pino. The goal is to examine whether housing prices have cooled in Melbourne, and help Anthony decide whether it is time to buy a two bedroom apartment in Northcote.

Your tasks

  1. (1pt) Make a map of Melbourne showing the locations of the properties.

The code pulls a google map centered on Melbourne, and overlays the points corresponding to the lat/long of the properties. There is a larger than Melbourne collection region, and the zoom accommodates the greater Melbourne region.

  1. (2pts) Here we are going to examine the prices of 2 bedroom flats in Northcote.
    1. Filter the data to focus only on the records for Northcote units. Make a plot of Price by Date, facetted by number of bedrooms. The main thing to learn from this plot is that there are many missing values for number of bedrooms.
    2. Impute the missing values, based on the regression method (covered in class). Make sure your predicted value is an integer. Re-make the plot of Price by Date, facetted by number of bedrooms.
    3. Write a description of what you learn from the plot, particularly about the trend of 2 bedroom unit prices in Northcote.

The code filters on the suburb and units. The trick in the imputation is to use the round function to generate integer imputed values.

nc <- mh %>% filter(Suburb == "Northcote", Type == "u")
ggplot(nc, aes(x=Date, y=Price)) + 
  geom_point() + geom_smooth(se=F) +
  facet_wrap(~Bedroom2, ncol=4) 

library(naniar)
nc_ms <- nc %>% bind_shadow()
br2 <- lm(Bedroom2~Rooms, data=nc_ms)
nc_ms <- nc_ms %>%
  mutate(Bedroom2=ifelse(is.na(Bedroom2),
    round(predict(br2, new=nc_ms), 0), Bedroom2))
ggplot(nc_ms, aes(x=Date, y=Price)) + 
  geom_point() + geom_smooth(se=F) +
  facet_wrap(~Bedroom2, ncol=4) 

We learn that there are not many 3 bedroom units in Northcote. For 1 and 2 bedroom units price has been pretty steady, with 2 bedroom units perhaps gradually increasing in price. You would need to be prepared to pay about $600,000 to buy a two bedroom unit. A price of around $500k could be a bargain. One bedroom units are worth around $350k.

  1. (2pts) Focusing on 2 bedroom units, we are going to explore the trend in prices for each suburb.
    1. You will need to impute the Bedroom2 variable, in the same way done in the previous question.
    2. Fit a linear model to each suburb (many models approach). Collect the model estimates, and also the model fit statistics. Make a plot of intercept vs slope. Using plotly what suburb has had the largest increase, which has had the biggest decrease in prices?
    3. Summarise the \(R^2\) for the model fits for all the suburbs. Which suburbs have the worst fitting models? Plot the Price vs Date of the best fitting model. Is the best fitting model a good fit?
    4. Write a paragraph on what you have learned about the trend in property prices across Melbourne.

The code imputes the missing bedroom values using a linear model on rooms. It then focuses only on two bedroom units, for suburbs with at least 30 records. (30 might be too high, so a lower choice could be reasonable too.)

mh_ms <- mh %>% bind_shadow()
br2 <- lm(Bedroom2~Rooms, data=mh_ms)
mh_ms <- mh_ms %>%
  mutate(Bedroom2=ifelse(is.na(Bedroom2),
    round(predict(br2, new=mh_ms), 0), Bedroom2))

mh_ms %>% filter(Type == "u", Bedroom2 == 2) %>% count(Suburb, sort = TRUE) %>% ggplot(aes(x=n)) + geom_histogram()

keep <- mh_ms %>% filter(Type == "u", Bedroom2 == 2) %>% 
  count(Suburb, sort = TRUE) %>%
  filter(n > 30)
mh_u <- mh_ms %>% filter(Suburb %in% keep$Suburb) %>%
  mutate(days = as.numeric(Date - ymd("2016-01-28")))
library(purrr)
by_suburb <- mh_u %>% 
  select(Suburb, Date, Price, days) %>%
  group_by(Suburb) %>% 
  nest()
by_suburb <- by_suburb %>% 
  mutate(
    model = purrr::map(data, ~ lm(Price ~ days, 
                                  data = .))
  )
suburb_coefs <- by_suburb %>% 
  unnest(model %>% purrr::map(broom::tidy))
suburb_coefs <- suburb_coefs %>% 
  select(Suburb, term, estimate) %>% 
  spread(term, estimate) %>%
  rename(intercept = `(Intercept)`)
#head(suburb_coefs)
p <- ggplot(suburb_coefs, aes(x=intercept, y=days, 
                          label=Suburb)) +
  geom_point(alpha=0.5, size=2) 
library(plotly)
ggplotly(p)

Across all of these suburbs the price of two bedroom units has been increasing. The price increases are pretty staggering. For example, in Malvern, starting from an average price of about $1.6mil in January 2016, prices have been increasing by $1200 PER DAY!!! Windsor started off with an average price around $670k and has increased by about $800 per day! On the other hand, there are three suburbs that have seen a decrease in price. Caulield North started at an average price about $1mil and it has been dropping by about $200 per day.

suburb_fit <- by_suburb %>% 
  unnest(model %>% 
           purrr::map(broom::glance))
p1 <- ggplot(suburb_fit, aes(x=r.squared)) + geom_histogram()
bestfit <- suburb_fit %>% filter(r.squared > 0.08)
mh_u_sub <- mh_u %>% filter(Suburb %in% bestfit$Suburb)
p2 <- ggplot(data=mh_u_sub, aes(x=Date, y=Price)) + 
         geom_point() + geom_smooth(method="lm", se=FALSE) +
  ggtitle(mh_u_sub$Suburb)
library(gridExtra)
grid.arrange(p1, p2, ncol=2)

All of the models are very weak! R2 ranges from 0.0-0.1, so at most 10% of the variation is explained. The best model corresponds to Windsor. Prices of 2 bedroom units in Windsor range from about $250k-$2.5mil! Prices have gone from $750k to $1.25mil on average in this two year period. However, there are still apartments sold for $500k this year.

  1. (2pts) Still focusing on apartments (units) examine the results of the auctions, with the Method variable, across suburbs. This variable contains results of the auction, whether the property sold, or not. It may be that in recent months there is a higher proportion of properties that didn’t sell. This would put downward pressure on prices.
    1. Compute the counts of the levels of Method, ignoring the suburbs.
    2. The categories PI (passed in) and VB (vendor bid) indicate the property did not sell. Compute the proportion of properties in these two categories for each suburb, for each month since 2016.
    3. Plot the proportions against year/month (make a new variable time is an integer with 1 being the first month of the data in 2016 and each month since then increments time by 1). Add a smoother to show the trend in these proportions. Does it look like there is an increase in units that aren’t selling?
    4. Explain why the data was aggregated to month before computing the proportions.

The code computes the proportion of properties in the PI or VB categories for each month. This is the proportion of properties that did not sell.

#mh_u %>% count(Method, sort=TRUE)
mh_u_mth <- mh_u %>% 
  mutate(year = year(Date), month = month(Date)) %>%
  group_by(Suburb, year, month) %>%
  count(Method) %>%
  mutate(p = n/sum(n)) %>%
  filter(Method %in% c("PI", "VB")) %>%
  mutate(time = (year-2016)*12+month)
p <- ggplot(mh_u_mth, aes(x=time, y=p, label=Suburb)) + 
  geom_smooth() +
  geom_point() + 
  facet_wrap(~Method)
ggplotly(p)

The properties were aggregated to month, to group enough properties together to make examining proportions reasonable. The proportion of properties that were passed in at auction has been fairly stable over this time period. The properties with vendor bids appears to have been increasing a little over the recent months. There are some suburbs with relly high rates of no sales.

  1. (2pts) Fit the best model for Price that you can, for houses around Monash University.
    1. Impute the missing values for Bathroom (similarly to Bedroom2).
    2. Subset the data to these suburbs “Notting Hill”, “Glen Waverley”, “Clayton”, “Clayton South”,“Oakleigh East”, “Huntingdale”, “Mount Waverley”.
    3. Make a scatterplot of Price vs Date by Bedroom2 and Bathroom, with a linear model overlaid. What do you notice? There are only some combinations of bedrooms and bathrooms that are common. Subset your data to houses with 3-4 bedrooms and 1-2 bathrooms.
    4. Using date, rooms, bedroom, bathroom, car and landsize build your best model for price. There are some missing values on Car and Landsize, which may be important to impute. Think about interactions as well as main effects. (There are too many missing values to use BuildingArea and YearBuilt. The other variables in the data don’t make sense to use.)

This code imputes the missings for bathroom, filters to the suburbs of interest, subsets to a set of bedrooms and bathrooms where there is enough data, and then fits some models.

ba2 <- lm(Bathroom~Rooms, data=mh_ms)
mh_ms <- mh_ms %>%
  mutate(Bathroom=ifelse(is.na(Bathroom),
    round(predict(ba2, new=mh_ms), 0), Bathroom))
monash <- mh_ms %>% filter(Suburb %in% c("Notting Hill", "Glen Waverley",
         "Clayton", "Clayton South","Oakleigh East", "Huntingdale", 
         "Mount Waverley"), 
                       Type == "h") %>%
  select(Suburb, Price, Rooms, Date, Bedroom2, Bathroom, Car, Landsize) %>%
  mutate(day=as.numeric(Date)) %>%
  mutate(day=day-min(day))
monash <- monash %>% filter(Bedroom2 > 2, Bedroom2<5, Bathroom<3) 
ggplot(monash, aes(x=Date, y=Price)) + 
  geom_point() + 
  geom_smooth(method="lm", se=FALSE) + 
  facet_grid(Bathroom~Bedroom2)

library(broom)
monash_fit <- lm(Price~day+Rooms+Bedroom2+Bathroom+Car+Landsize, data=monash)
tidy(monash_fit)
         term    estimate   std.error  statistic      p.value
1 (Intercept) 375121.2044 189450.4873  1.9800488 4.941296e-02
2         day   -376.2263    256.8410 -1.4648222 1.449316e-01
3       Rooms  53345.5305 179882.6259  0.2965574 7.671890e-01
4    Bedroom2 -47745.2684 183785.6009 -0.2597879 7.953614e-01
5    Bathroom 104562.8733  55160.6243  1.8956071 5.981511e-02
6         Car -94667.4427  28842.4340 -3.2822279 1.264606e-03
7    Landsize   1421.7348    146.5577  9.7008523 8.685319e-18
glance(monash_fit)
  r.squared adj.r.squared    sigma statistic      p.value df    logLik
1 0.4122985     0.3902597 294654.6  18.70784 1.993047e-16  7 -2336.512
       AIC      BIC     deviance df.residual
1 4689.023 4713.967 1.389141e+13         160
monash_fit2 <- lm(Price~day+Bathroom+Car+Landsize, data=monash)
tidy(monash_fit2)
         term    estimate   std.error statistic      p.value
1 (Intercept) 397445.3720 134337.3833  2.958561 3.554109e-03
2         day   -373.8429    253.7763 -1.473120 1.426591e-01
3    Bathroom 105225.3067  47108.2139  2.233693 2.687241e-02
4         Car -94277.2172  28572.1769 -3.299616 1.190568e-03
5    Landsize   1412.7699    139.7539 10.108983 6.222246e-19
glance(monash_fit2)
  r.squared adj.r.squared    sigma statistic      p.value df   logLik
1 0.4119556      0.397436 292915.5  28.37235 7.217381e-18  5 -2336.56
       AIC      BIC     deviance df.residual
1 4685.121 4703.829 1.389952e+13         162
monash_fit3 <- lm(Price~day+Bedroom2*Bathroom+Car+Landsize, data=monash)
tidy(monash_fit3)
               term     estimate   std.error statistic      p.value
1       (Intercept) -337844.7951 696993.9479 -0.484717 6.285400e-01
2               day    -353.1323    256.8464 -1.374877 1.710918e-01
3          Bedroom2  223012.1594 210308.3294  1.060406 2.905581e-01
4          Bathroom  531837.4080 400710.8085  1.327235 1.863217e-01
5               Car  -92786.0755  28777.1475 -3.224297 1.530910e-03
6          Landsize    1433.5867    143.2217 10.009560 1.283597e-18
7 Bedroom2:Bathroom -130577.5099 120918.2650 -1.079882 2.818197e-01
glance(monash_fit3)
  r.squared adj.r.squared    sigma statistic      p.value df    logLik
1 0.4162302     0.3943388 293667.3  19.01344 1.186591e-16  7 -2335.951
       AIC      BIC     deviance df.residual
1 4687.902 4712.846 1.379848e+13         160

This is a selection of models above. The best R2 I get is about 0.41. I am curious about the best values obtained by the students. The deviance is very high for all. Interactions don’t help too much.

Based on the plot, we would expect the properties with one bathroom to have negative trends, and properties with 2 bathrooms have generally increasing prices for 3 bedroom but not for 4 bedrooms. This suggests an interaction term should help.

There should be some interpretation of the final model reported. For example, in the models above it would suggest the negative relationship for some levels of bedroom/bathroom. Surprisingly having a car space decrease the price, perhaps having an interaction term here could be useful to understand if it is contingent on the bedrooms or bathrooms. Generally the higher the land size the higher the price.

It would be good to see in the report that they break down the model, by substituting some values for the explanatory variables and predicting the price. This helps understand the interplay of the different variables.

Grading

One point for overall report organisation and readability.